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Abstract 

Inner  holes,  artifacts  and  blank  spots  are  common  in  microarray  images,  but  current  image 
analysis  methods  do  not  pay  them  enough  attention.  We  propose  a  new  robust  model-based 
method  for  processing  microarray  images  so  as  to  estimate  foreground  and  background  intensi¬ 
ties.  The  method  starts  with  a  very  simple  but  effective  automatic  gridding  method,  and  then 
proceeds  in  two  steps.  The  first  step  applies  model-based  clustering  to  the  distribution  of  pixel 
intensities,  using  the  Bayesian  Information  Criterion  (BIC)  to  choose  the  number  of  groups  up 
to  a  maximum  of  three.  The  second  step  is  spatial,  finding  the  large  spatially  connected  compo¬ 
nents  in  each  cluster  of  pixels.  The  method  thus  combines  the  strengths  of  histogram-based  and 
spatial  approaches.  It  deals  effectively  with  inner  holes  in  spots  and  artifacts.  It  also  provides 
a  formal  inferential  basis  for  deciding  when  the  spot  is  blank,  namely  when  the  BIC  favors  one 
group  over  two  or  three.  In  experiments,  our  method  had  better  stability  across  replicates  than 
a  fixed-circle  segmentation  method  or  the  seeded  region  growing  method  in  the  SPOT  software, 
without  introducing  noticeable  bias  when  estimating  the  intensities  of  differentially  expressed 
genes.  An  R  software  package  called  spotSegmentation  implementing  the  method  is  being 
made  available  through  the  BioConductor  project. 
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1  Introduction 


Microarray  technology  provides  a  useful  tool  to  assay  the  expression  of  a  large  number  of  genes 
simultaneously.  The  DNA  obtained  from  the  genes  of  interest  is  printed  on  a  glass  microscope  slide 
by  a  robotic  arrayer.  In  the  two-color  array,  the  cDNA  extracted  from  the  experimental  and  control 
samples  are  first  labelled  using  the  Cy3  (green)  and  Cy5  (red)  fluorescent  dyes.  Then  they  are  mixed 
and  hybridized  with  the  arrayed  DNA  spots.  After  hybridization,  the  arrays  are  scanned  at  the 
corresponding  wavelengths  separately  to  obtain  the  images  corresponding  to  the  two  channels.  The 
fluorescence  measurements  are  used  to  determine  the  relative  abundance  of  the  rnRNA  or  DNA  in 
the  samples.  However,  the  quantification  of  the  amount  of  fluorescence  hybridized  is  affected  by 
things  that  happen  during  the  manufacturing  of  the  arrays,  such  as  perturbations  of  spot  positions, 
irregularities  of  spot  shapes,  holes  in  spots,  unequal  distribution  of  cDNA  within  spots,  variable 
background,  and  artifacts.  Ideally  these  events  should  be  recognized  when  they  occur,  and  the 
estimated  intensities  adjusted  to  take  account  of  them. 

Several  commercial  and  research  image  processing  packages  have  been  developed  for  analyzing 
microarray  data.  For  segmentation,  the  existing  methods  can  be  grouped  into  four  categories, 
namely  fixed  circle  segmentation,  adaptive  circle  segmentation,  adaptive  shape  segmentation  and 
histogram  segmentation,  as  reviewed  by  Yang  et  al.  (2002).  Fixed  circle  segmentation  assumes 
that  the  spots  have  a  circular  shape  and  fits  a  circle  with  a  fixed  radius  to  all  the  spots.  It  was 
probably  first  implemented  in  ScanAlyze  (Eisen  1999).  Spot-on,  a  customized  software  written  at 
the  University  of  Washington  (Spot-On  Image,  developed  by  R.  E.  Bumgarner  and  Erick  Ham- 
mersmark),  also  implements  this  algorithm.  Adaptive  circle  segmentation  improves  the  method  by 
allowing  the  radius  of  the  circle  to  be  adjustable.  However,  a  circular  spot  mask  provides  a  poor 
fit  to  irregular  spots  or  donut-shaped  spots  with  inner  holes,  which  are  often  seen  in  microarray 
images. 

Several  recent  developments  belong  to  the  class  of  adaptive  shape  segmentation.  The  seeded 
region  growing  approach  (Adams  and  Bischof  1994)  is  used  to  segment  microarray  images  in  the 
SPOT  software  (Yang  et  al.  2002).  The  foreground  and  background  are  grown  from  two  initial 
seeds;  this  method  can  adapt  to  various  shapes  of  spots.  Histogram  methods  are  intensity-based, 
and  use  a  target  mask  which  is  chosen  to  be  larger  than  all  spots.  The  pixels  are  classified  as 
foreground  or  background  using  thresholds  from  the  histogram  of  pixel  values  within  the  masked 
area.  Histogram  methods  do  not  use  any  spatial  information,  and  so  the  resulting  spot  masks  are 
not  necessarily  connected.  Ahmed  et  al.  (2004)  provide  evidence  that,  although  histogram  methods 
do  not  take  spatial  aspects  into  account,  they  yield  better  intensity  estimates  than  other  methods. 

Many  current  methods  have  difficulties  with  donut-shaped  spots,  artifacts  such  as  scratches, 
and  blank  spots,  all  of  which  are  common  in  practical  microarray  work.  When  the  spot  is  donut¬ 
shaped,  many  current  methods  identify  the  outer  contour  of  the  spot  as  the  mask;  this  eads  to 
downward  bias  in  the  estimated  intensity.  Another  common  problem  is  that  a  foreground  is  always 
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generated  even  when  no  spot  is  present.  This  tends  to  inflate  the  variance  of  the  estimates. 

In  this  paper  we  propose  an  approach  to  image  segmentation  and  intensity  estimation  combining 
two  simple  steps:  model-based  clustering  of  pixel  intensities,  and  spatial  connected-component 
extraction.  We  start  by  using  a  very  simple  automatic  gridding  method.  We  then  apply  model- 
based  clustering  to  the  pixel  intensities,  which  allows  us  to  estimate  the  number  of  groups  in  the 
target  area,  and  hence  provides  a  formal  basis  for  determining  whether  or  not  a  spot  is  present, 
namely  when  the  number  of  groups  estimated  is  equal  to  one.  Our  final  spot  consists  of  the 
large  connected  components  of  the  foreground  cluster  of  pixels.  An  R  software  package  called 
spotSegmentation  implementing  the  method  is  being  made  available. 

By  itself,  model-based  clustering  of  pixels  is  a  histogranr-based  method.  Thus  our  method 
combines  the  strengths  of  the  histogram  method  documented  by  Ahmed  et  al.  (2004)  with  a 
simple  spatial  step  that  ensures  as  much  as  possible  spatial  coherence  of  the  resulting  estimated 
spots.  It  handles  donut-shaped  spots  well  because  the  estimated  spots  can  easily  be  of  this  form. 
It  deals  effectively  with  artifacts  such  as  scratches  because  they  get  classified  as  a  separate  group 
of  pixels  and  are  not  included  in  the  foreground  or  background  intensity  estimates.  Perhaps  most 
importantly,  it  deals  explicitly  with  blanks,  i.e.  locations  on  the  slide  where  there  is  no  spot;  this 
is  something  that  few  other  current  methods  do.  In  experiments  we  found  the  results  to  be  more 
stable  than  those  from  either  the  fixed-circle  method  or  the  region  seeded  growing  method,  without 
introducing  substantial  biases.  We  did  our  experiments  on  two-color  arrays,  but  the  method  is 
generic  and  can  be  extended  to  other  types  of  array. 

The  term  “model-based  segmentation”  has  been  used  to  describe  methods  based  on  the  assump¬ 
tion  that  the  areas  of  interest  follow  a  parametric  form  (e.g.  Bergemann  et  al.  2004).  Fixed-circle 
and  adaptive  circle  methods  are  of  this  type.  However,  our  method  does  not  make  this  rather 
restrictive  assumption.  Instead,  our  method  is  called  “model-based”  because  it  is  based  on  model- 
based  clustering  of  the  pixel  intensities.  It  is  very  flexible  in  terms  of  the  shape  of  spot  that  it  can 
accurately  recover. 

In  Section  2,  we  describe  our  image  segmentation  method,  including  automatic  gridding,  model- 
based  clustering  of  pixels,  spatial  connected-component  extraction,  and  final  estimation  of  fore¬ 
ground  and  background  intensities.  In  Section  3  we  give  the  results  of  applying  our  method  to 
microarray  images  from  an  HIV  infection  experiment.  The  results  are  compared  with  those  from 
fixed-circle  segmentation  as  implemented  in  Spot-on  and  seeded  region  growing  as  implemented 
in  SPOT.  In  Section  4  we  describe  the  R  software  package,  spotSegmentation,  implementing  the 
methods. 

2  Methods 

Our  method  consists  of  several  simple  steps:  automatic  gridding,  model-based  clustering  of  pixels, 
and  spatial  connected  component  extraction.  Figure  1  gives  an  outline  of  the  whole  procedure  in 
flowchart  form.  We  now  describe  each  of  the  steps  in  turn. 
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Figure  1:  Outline  of  Model-based  Segmentation. 

2.1  Automatic  Gridding 

In  order  to  segment  the  image,  we  must  first  identify  the  location  of  each  spot.  This  process  is  called 
gridding  or  addressing.  A  nricroarray  typically  consists  of  several  blocks  with  the  same  layout.  The 
print-tips  on  the  arrayer  are  normally  arranged  in  a  regular  array.  Under  perfect  conditions,  the 
spots  in  each  block  locate  in  an  evenly-spaced  lattice  corresponding  to  the  layout  of  the  print-tips. 
However,  the  variation  during  the  printing  of  the  array  will  cause  the  exact  locations  of  spots  to  vary 
from  the  prespecified  parameter.  Even  if  the  irregularities  are  slight,  they  can  result  in  significant 
irregularity  in  the  image,  and  hence  have  to  be  corrected. 

In  order  to  locate  the  spots,  we  do  not  need  to  find  their  centers,  but  rather  the  edges  of  the 
target  mask,  i.e.  the  rectangle  containing  the  spot.  As  long  as  the  rectangle  contains  only  the 
pixels  from  a  single  spot,  it  is  a  valid  target  mask. 

Our  algorithm  is  as  follows: 

•  Sum  up  the  intensities  across  the  pixels  in  each  row  and  each  column. 

•  Find  the  local  minima  of  the  summed  intensities  using  a  sliding  window  with  span  approxi- 
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mately  equal  to  the  width  of  a  typical  spot. 

This  method  is  extremely  simple  and  does  not  require  human  interaction.  The  only  control 
parameters  to  be  specified  are  the  number  of  spots  in  each  row  or  column  (as  specified  by  the  array 
manufacturer),  and  the  size  of  the  sliding  window.  A  crude  estimate  using  the  known  number  of 
rows  and  columns  suffices  for  the  window  size  in  the  arrays  we  have  used. 

Figures  2  and  3  show  the  results  of  applying  the  method  to  a  12  x  32  subarray  from  the  HIV 
experiment  dataset  we  will  describe  later.  Figure  2  shows  the  summed  intensities;  the  valleys 
correspond  to  the  grid  lines.  Figure  3  shows  the  resulting  grids;  this  captures  the  locations  of  the 
spots  well. 


(a)  Row  (b)  Column 

Figure  2:  Row  (column)  Sum  of  Intensity  and  Grid  on  a  12  x  32  Subarray.  Because  the  spots 
are  located  loosely  on  a  rectangle  grid,  the  row  (column)  sums  present  a  peak-valley  pattern,  with 
peaks  corresponding  to  the  average  center  of  spots  on  the  row  (column)  and  valleys  the  delimiters 
of  spots.  Grids  are  placed  at  the  valleys  of  the  curves. 

2.2  Model-Based  Clustering  of  Pixels 

The  gene  expression  level  is  proportional  to  the  pixel  intensities  of  a  spot.  Thus  pixels  that  are 
in  the  spot  or  foreground  should  have  similar  intensities,  and  pixels  that  are  in  the  background 
should  also  have  similar  intensities.  In  addition,  pixels  that  belong  to  an  artifact  such  as  a  scratch 
that  is  neither  spot  nor  background  will  tend  to  have  intensities  that  are  different  from  either. 

As  a  result,  clustering  the  pixel  intensities  makes  sense  as  an  approach  to  segmentation;  this  is 
the  idea  underlying  histogranr-based  methods.  Here  we  apply  model-based  clustering  to  the  pixel 
intensities. 

In  model-based  clustering,  data  are  viewed  as  coming  from  a  mixture  density  f(x)  =  Yhk=i  nkfk(x)- 
Here,  ir^s  are  the  mixing  proportions  (0  <  Tik  <  1  for  all  k  =  1, ... ,  K  and  Ylk^k  =  1),  and  fk  is 
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Figure  3:  Array  image  with  grids  overlaid. 


the  density  of  the  observations  in  group  k.  In  the  Gaussian  mixture  model,  each  component  k  is 
modeled  by  the  multivariate  normal  distribution  with  mean  / 1 and  covariance  matrix 


,  /ifc  ,  Xfc) 


exp -  pk)} 
v/det(27r  Xfe) 


(1) 


The  likelihood  of  the  data  for  a  Gaussian  mixture  with  K  mixture  components  is 


n  K 

Lmix  =  |  |  ^  ^  ^fc) •  (2) 

z=l 


For  reviews  of  model-based  clustering,  see  McLachlan  and  Peel  (2000),  Fraley  and  Raftery  (2002). 

For  a  fixed  number  of  clusters  K ,  the  model  parameters  pk,  pj.,  and  can  be  estimated  using 
the  EM-algorithm  hierarchical  model-based  clustering  step  (Dasgupta  and  Raftery  1998;  Fraley  and 
Raftery  1998).  The  number  of  groups,  K,  can  be  estimated  by  maximizing  the  Bayesian  Informa¬ 
tion  Criterion  (BIC).  Model-based  clustering  is  implemented  in  the  MCLUST  software  (Fraley  and 
Raftery  1999;  Fraley  and  Raftery  2003),  which  is  available  at  http://www.stat.washington.edu/mclust 
or  http://cran.us.r-project.org. 

To  combine  the  signals  from  the  two  channels,  the  red  and  green  intensities  are  summed.  In¬ 
spection  of  the  resulting  histograms,  such  as  that  in  Figure  4,  suggest  that  it  is  reasonable  to  assume 
that  the  distribution  of  the  summed  intensities  is  approximately  a  mixture  of  Gaussian  densities. 
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sum  of  intensities 


Figure  4:  Distribution  of  spot  intensity 

We  have  some  prior  information  about  the  number  of  groups  of  pixels  present  in  the  images. 
Typically,  background  pixels  would  be  one  group  and  pixels  in  the  spot  or  foreground  would  be 
another.  In  addition,  if  an  artifact  is  present,  or  if  the  spot  is  donut-shaped  and  has  an  inner 
hole,  the  corresponding  pixels  would  form  a  third  group.  Thus  in  most  cases  we  would  expect  the 
number  of  groups,  K ,  to  be  at  most  3.  We  use  BIC  to  choose  I\ ,  but  restrict  the  possible  choices 
to  I\  <  3. 

We  have  three  cases:  K  =  1,  K  =  2  and  K  =  3.  The  case  K  =  1  would  correspond  to  the 
situation  where  there  is  no  spot,  i.e.  a  blank,  and  our  method  provides  a  principled  statistical  basis 
for  detecting  this  situation.  The  case  K  =  2  would  arise  in  the  typical  situation  where  there  is  a 
spot,  with  background.  And  K  =  3  would  be  chosen  when  there  is  a  spot  and  an  artifact  or  an 
inner  hole. 

2.3  Spatial  Connected  Component  Extraction  and  Intensity  Estimation 

Artifacts  often  take  the  form  of  small  disconnected  groups,  and  so  a  threshold  on  the  size  of  the 
connected  components  in  a  cluster  can  identify  clusters  formed  by  artifacts  in  many  cases.  We  apply 
a  4-neighbor  connected  component  labeling  procedure  (Haralick  and  Shapiro  1992)  to  the  clusters 
to  divide  them  into  spatially  connected  components.  We  retain  only  the  connected  components 
that  are  a  given  threshold  in  size  and  discard  the  other  components.  The  default  threshold  we 
use  is  100  pixels,  which  is  about  one-sixth  of  the  typical  size  of  a  spot  on  the  arrays  used  in  our 
examples. 

The  brightest  and  darkest  clusters  passing  the  threshold  are  classified  as  foreground  and  back¬ 
ground,  respectively.  If  only  one  cluster  passes  the  threshold,  we  conclude  that  there  is  no  spot 
and  that  the  location  is  blank.  Our  estimate  of  foreground  intensity  in  the  Cy3  channel  is  the 
mean  of  the  pixels  in  the  foreground  cluster.  Similarly  for  the  Cy5  channel  foreground,  where  the 
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same  pixels  are  in  the  cluster  for  both  channels.  The  background  intensities  of  the  two  channels 
are  estimated  in  the  same  way. 

In  this  way,  we  leave  out  the  disconnected  pixels  for  intensity  estimation.  Also,  when  three 
clusters  are  identified,  we  also  exclude  the  intermediate  cluster  of  pixels,  which  often  consists  of 
“suspicious”  pixels,  such  as  an  inner  hole,  an  artifact,  or  a  blurry  rim. 

The  estimated  signal  is  Is  =  1^  —  Ib,  where  P  and  Ib  are  the  mean  intensities  of  the  foreground 
and  background,  respectively.  The  true  signal  is  always  nonnegative,  but  occasionally  the  estimated 
signal,  Is,  is  negative.  In  this  case,  it  is  reasonable  to  assume  that  the  true  intensity  is  small  but 
positive.  When  this  happens,  we  therefore  set  Is  to  be  equal  to  the  5th  percentile  of  the  spot  signals 
on  the  array. 

3  Results 

We  applied  our  proposed  method  to  several  microarrays  which  had  been  produced  to  identify  the 
genes  differentially  expressed  in  HIV  infected  cells.  The  expression  levels  of  4068  cellular  RNA 
transcripts  were  assessed  in  CD4-T-cell  lines  at  24  hours  after  infection  with  HIV  virus  type  1. 
Here  we  consider  4  replicate  subarrays,  each  consisting  of  12  x  32  =  384  genes,  including  three 
HIV-1  genes  used  as  positive  controls.  All  the  four  replicates  shared  the  same  DNA  samples.  Two 
of  the  replicates  were  from  a  dye-swap  experiment  in  which  the  dyes  were  switched  between  the 
two  channels;  this  can  be  helpful  for  canceling  out  the  dye-binding  effects.  Further  details  can  be 
found  in  the  original  paper  (van’t  Wout  et  al.  2003).  The  image  files  can  be  found  at 
http://expression.microslu.washington.edu/expression/vantwoutjvi2002.html.  In  these  microarrays, 
a  large  number  of  spots  have  donut  shapes  with  one  or  more  holes  in  them.  We  compare  our  method 
with  two  other  methods  representative  of  the  range  of  methods  available:  the  well-known  software 
package  SPOT,  which  segments  using  seeded  region  growing,  and  a  customized  software  written  at 
the  University  of  Washington  (Spot-On  Image,  developed  by  R.  E.  Bumgarner  and  Erick  Hammers- 
rnark),  which  implements  fixed  circle  segmentation  and  estimates  background  using  four  smaller 
circles  in  the  corners  of  the  rectangle. 

Figures  5-10  show  the  results  for  different  individual  spots.  Figure  5  shows  an  ideal  case  with 
a  single  regularly  shaped  spot  and  no  artifacts.  There  SPOT  and  our  method  both  perform  well, 
but  the  fixed-circle  method  of  Spot-on  is  inaccurate,  missing  some  of  the  spot  and  including  some 
of  the  background  in  it. 

Figure  6  is  an  example  of  a  donut-shaped  spot.  The  fixed-circle  method  again  misrepresents 
the  shape  of  the  spot.  The  seeded  region  growing  method  of  SPOT  takes  the  spot  to  be  all  the 
pixels  inside  the  outer  contour  of  the  donut  shape.  This  includes  the  inner  hole,  which  is  darker 
than  the  spot,  and  so  may  lead  to  intensity  estimates  that  are  biased  downwards.  Our  method 
correctly  identifies  the  donut  shape.  The  inner  hole  is  identified  as  a  third  cluster,  and  the  pixels 
in  the  inner  hole  are  not  included  in  the  calculation  of  either  foreground  or  background  intensity. 

Figure  7  shows  a  different  kind  of  donut  shape,  with  a  small  inner  hole  that  is  brighter  than  the 
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(a)  Our  method 


(b)  Spot 


(c)  Spot-on 


Figure  5:  Segmentation  results:  Good 


(a)  Our  method  (b)  Spot  (c)  Spot-on 

Figure  6:  Segmentation  results:  Donut 

spot.  The  fixed-circle  method  again  does  not  perform  well.  SPOT  identifies  the  small  inner  hole  as 
the  spot.  Our  method,  in  contrast,  identifies  the  donut  shape  of  the  spot.  The  inner  hole  is  brighter 
than  the  main  body  of  the  spot,  and  it  is  identified  as  a  third  cluster,  but  it  is  not  identified  as  the 
foreground  because  it  is  too  small  to  pass  the  threshold  of  100  pixels. 

Figure  8  includes  several  small  artifacts,  one  or  two  inside  the  spot,  one  at  the  bottom,  and 
perhaps  one  on  the  left.  The  fixed-circle  segmentation  includes  most  of  the  artifacts  in  the  spot. 
SPOT  includes  the  inner  artifacts  in  the  spot,  and  misses  part  of  the  spot.  Our  method  finds  the 
shape  of  the  spot  correctly  and  excludes  the  artifacts. 

Figure  9  shows  a  blank  spot  with  a  small  artifact.  The  fixed-circle  method  finds  a  spot  anyway. 
SPOT  also  finds  an  oddly-shaped  area  which  does  not  correspond  to  any  real  spot.  Our  method 
correctly  infers  that  there  is  no  spot  in  this  rectangle.  In  fact,  BIC  chose  K  =  2  in  this  case,  with 
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(a)  Our  method  (b)  Spot  (c)  Spot-on 

Figure  7:  Segmentation  results:  Donut 


(a)  Our  method  (b)  Spot  (c)  Spot-on 

Figure  8:  Segmentation  results:  Artifacts 

the  second  cluster  being  the  small  artifact,  but  it  was  excluded  because  it  was  too  small,  below  the 
100-pixel  threshold. 

Figure  10  shows  another  blank  spot.  Again,  the  fixed-circle  method  and  SPOT  identify  areas 
which  do  not  correspond  to  any  real  spot,  while  our  method  concludes  that  no  spot  is  present.  In 
this  case,  BIC  chose  K  =  1,  so  the  conclusion  that  there  is  no  spot  present  was  clear-cut. 

We  now  turn  to  more  global  evaluation  of  the  different  methods.  Figure  11  displays  the  seg¬ 
mentation  results  of  part  of  the  subarray  using  our  approach  (left  column),  and  the  results  from 
SPOT  (right  column).  Our  criterion  is  stability  of  estimated  expression  levels  across  replicates.  We 
evaluate  the  stability  of  intensity  estimation  as  the  variation  in  the  logratio  estimate,  l  =  log2 /1//2, 
across  replicates,  where  1 1  and  I2  are  the  signal  estimates  from  channel  1  and  2,  respectively,  as 
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(a)  Our  method 


(b)  Spot 


(c)  Spot-on 


Figure  9:  Segmentation  results:  Blank 


(a)  Our  method  (b)  Spot  (c)  Spot-on 

Figure  10:  Segmentation  results:  Blank 

defined  in  Section  2.3.  Stability  is  measured  by  the  sum  of  squared  differences,  defined  as 

SSD  =  -  k)\  (3) 

where  N  is  the  total  number  of  spots  on  the  array,  R  is  the  total  number  of  replicates,  is  the 
log  ratio  for  the  zth  spot  on  the  rth  replicate,  and  lt  is  the  mean  of  the  logratio  across  all  replicates 
for  the  zth  spot.  If  no  foreground  is  identified,  I1/I2  is  set  to  1.  We  apply  median  normalization  to 
our  estimate  as  well  as  the  estimates  from  SPOT  and  Spot-on  before  calculating  the  log-ratios. 

Table  1  shows  the  comparison  of  stability  between  the  three  methods.  Our  method  demonstrates 
better  stability  than  both  the  fixed-circle  method  of  Spot-on  and  the  seeded  region  growing  method 
of  SPOT.  In  the  12  x  32  array,  the  SSD  of  our  method  was  51.5%  lower  than  that  of  the  fixed-circle 
method  and  20.6%  lower  than  that  of  SPOT. 

A  good  method  not  only  has  less  variation,  but  also  does  not  bias  the  estimated  expression 
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(a)  Model-based  segmentation  (b) 

Figure  11:  Segmentation  results  for  a  12  x 


Seeded  region  growing 
8  subset  of  the  array. 


13 


Table  1:  Sum  of  Squared  Differences  of  Estimates  of  Logratios  for  384  Genes  Across  4  Replicates. 


SSD 

%reduction 

Model-based  segmentation 

657.02 

- 

Seeded  region  growing 

826.99 

20.6% 

Fixed  circle 

1354.10 

51.5% 

levels  of  highly  expressed  genes  downwards.  A  method  could  achieve  small  variation  by  reducing 
the  estimated  expression  levels  of  all  genes,  including  those  that  are  differentially  expressed,  but 
this  would  not  be  a  very  useful  method.  Because  HIV  genes  are  present  only  in  the  HIV  infected 
sample  and  are  highly  expressed  in  the  HIV  infected  sample,  they  can  be  used  as  positive  controls  to 
check  whether  estimated  expression  levels  of  highly  expressed  genes  are  biased  downwards.  There 
are  three  HIV  genes  in  the  subarray.  Table  2  shows  the  average  of  the  estimated  logratio  across  the 
four  replicates  for  these  three  genes.  The  estimates  from  our  method  are  very  close  to  those  from 
seeded  region  growing.  They  are  a  little  smaller  than  those  from  fixed-circle  segmentation,  but  this 
is  so  much  more  unstable  than  our  method  that  this  does  not  seem  to  be  of  great  concern. 

Table  2:  Average  Logratios  for  the  3  HIV  Genes  across  Replicates.  The  logratios  (base  2)  are 
median  normalized.  _ 


HIV1 

HIV2 

HIV3 

Model-based  segmentation 

-10.46 

-11.62 

-11.03 

Seeded  region  growing 

-10.33 

-10.10 

-10.50 

Fixed  circle 

-12.81 

-12.75 

-12.87 

As  a  final  assessment,  we  carried  out  a  subjective  evaluation  of  whether  the  method  was  suc¬ 
cessfully  identifying  blank  spots  (i.e.  genes  that  were  not  expressed  on  the  microarray).  Human 
eyes  typically  segment  images  better  than  machine  vision,  and  so  we  compared  the  results  from  our 
automatic  computer  method  with  those  from  a  subjective  evaluation  by  one  of  the  present  authors, 
acting  as  a  human  subject.  The  subject  examined  the  raw  images  of  four  replicates  of  a  12  x  8 
subarray  without  prior  knowledge  of  the  machine  segmentation,  and  coded  the  resulting  384  spots 
into  one  of  three  classes: 

•  Not  expressed:  No  visible  spots  in  both  channels; 

•  Questionable:  No  visible  spots  in  one  channel  and  questionable  in  the  other  channel,  or 
questionable  in  both  channels; 

•  Expressed:  otherwise 

Table  3  shows  the  cross-classification  of  the  subjective  decision  and  the  segmentation  using  our 
method.  The  agreement  is  quite  close:  85%  for  genes  not  expressed,  and  98%  for  expressed  genes. 
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In  addition,  of  the  genes  about  which  the  human  subject  was  unsure,  our  automatic  method  split 
them  fairly  evenly,  identifying  42%  as  not  expressed  and  58%  as  expressed. 


Table  3:  Cross-Classification  of  Subjective  against  Automated  Assessments  of  96  genes  in  4  repli¬ 
cates.  _ 


Automated  Decision 

Not  Expressed 

Expressed  %  Agreement 

Subjective 

Not  Expressed 

23 

4  85 

Decision 

Questionable 

11 

15 

Expressed 

5 

326  98 

4  Software 

The  methods  described  in  this  paper  are  implemented  in  the  R  language  contributed  package 
spotSegmentation.  The  software  consists  of  two  basic  functions:  spotgrid,  which  determines 
rectangles  within  cDNA  microarray  slides  in  which  individual  spots  are  located,  and  spotseg, 
which  determines  foreground  and  background  signals  within  the  spots. 

The  spotgrid  function  is  used  to  divide  a  microarray  image  block  into  a  grid  separating  the 
individual  spots.  It  takes  as  input  the  intensities  from  the  two  channels,  along  with  the  known 
numbers  of  rows  and  columns  of  spots  within  a  block  on  a  slide.  The  output  is  the  row  and  column 
locations  defining  a  grid  separating  the  individual  spots.  There  is  an  option  to  display  the  grid 
with  the  image  superimposed. 

Individual  spots  can  be  segmented  using  the  function  spotseg.  It  takes  as  input  the  intensities 
from  the  two  channels,  along  with  the  row  and  column  delimiters  of  the  spots  within  a  block  on 
a  slide  (e.g.,  as  determined  by  spotgrid).  There  is  an  option  to  display  the  various  stages  of 
the  segmentation  process  for  individual  spots,  as  well  as  to  display  the  entire  block  of  segmented 
spots  at  the  end  of  processing.  Mean  and  median  pixel  intensities  for  the  foreground  and  back¬ 
ground  for  each  channel  and  each  spot  can  be  recovered  through  the  summary  function  applied 
to  the  output  of  spotseg.  The  spotseg  function  requires  the  MCLUST  package  (http://cran.r- 
project.org/src/contrib/PACKAGES.html)  for  the  clustering  phase. 

The  spotSegmentation  package  is  being  made  available  through  the  BioConductor  project 
(see  http:/ /www.bioconductor.org). 

5  Discussion 

We  have  described  a  two-step  method  for  segmenting  microarray  images  and  estimating  intensities. 
The  two  steps  are  model-based  clustering  of  pixel  intensities,  and  spatial  connected  component 
extraction.  Each  of  the  steps  is  simple  to  implement.  The  method  provides  a  principled  statistical 
basis  for  determining  whether  or  not  a  gene  is  expressed  at  a  spot,  and  thus  deals  explicitly  with 
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blank  spots.  It  also  deals  effectively  with  donut-shaped  spots  with  inner  holes  and  with  artifacts.  In 
experiments  it  yielded  results  that  were  more  stable  across  replicates  than  fixed-circle  segmentation 
or  the  seeded  region  growing  method  implemented  in  the  SPOT  software,  without  introducing 
noticeable  biases  in  the  estimated  expression  levels  of  differentially  expressed  genes.  We  are  making 
available  a  software  package  in  R,  called  spotSegmentation,  implementing  the  method. 

Recently,  some  clustering  methods  have  been  used  for  image  processing  of  nricroarray  data 
(Bozinov  and  Rahnenfrihrer  2002;  Nagarajan  2003;  Glasbey  and  Ghazal  2003).  Bozinov  and  Rah- 
nenfrihrer  (2002)  used  &-means  and  Partitioning  Around  Medoids  (PAM)  on  the  two-dimensional 
vectors  of  the  intensities,  and  Rahnenfiirer  and  Bozinov  (2004)  improve  on  this  by  considering  only 
the  pixels  within  the  average  spot  shape,  which  turns  out  to  be  almost  exactly  a  circle.  Nagara¬ 
jan  (2003)  used  the  same  method,  but  only  on  the  intensities  from  the  green  channel.  Glasbey 
and  Ghazal  (2003)  considered  a  Gaussian  mixture  model  for  the  two-dimensional  vector  of  the 
square  root  of  the  intensities.  All  of  these  methods  consider  only  two  clusters.  As  Bozinov  and 
Rahnenfrihrer  (2002)  pointed  out  and  we  also  observed,  fixing  the  number  of  clusters  to  two  can 
mislead  the  clustering  algorithm  into  taking  the  large  brighter  artifacts  as  foreground  and  combin¬ 
ing  the  dimmer  spot  pixels  into  the  background.  It  also  excludes  the  ability  to  formally  identify 
blank  spots  provided  in  our  method  by  the  data-based  choice  of  the  number  of  clusters.  Antonio 
et  al.  (2004)  used  a  clustering  method  that  does  not  constrain  the  number  of  clusters,  but  in  the 
experiments  reported  it  did  not  exclude  the  inner  holes  of  donut-shaped  spots. 

For  segmentation,  QuantArray  (GSI  Luminomics  1999)  applies  a  threshold  to  the  histogram  of 
pixel  values  in  a  target  region  around  a  spot,  ScanAlyse  (Eisen  1999)  uses  a  circle  of  fixed  radius, 
GenePix  (Axon  Instruments  Inc.  1999)  uses  a  circle  with  adaptive  radius,  and  UCSF  Spot  (Jain 
et  al.  2002)  uses  histogram  information  within  a  circle.  Liew  et  al.  (2003)  use  an  adaptive  circle 
method,  while  Bergemann  et  al.  (2004)  generalize  this  by  using  an  adaptive  ellipse.  They  flag  inner 
holes,  but  the  user  has  to  decide  what  to  do  with  them  when  estimating  intensities. 

Steinfath  et  al.  (2001)  fitted  a  scaled  bivariate  Gaussian  distribution  to  pixel  values,  but  using 
a  robust  fitting  method.  Brandle  et  al.  (2003)  described  a  robust  fitting  for  the  Gaussian  spot 
model  using  an  M-estimator.  Schadt  et  al.  (1999)  proposed  an  adaptive  pixel  selection  algorithm 
to  remove  pixels  contaminated  by  noise. 

Kim  et  al.  (2001)  used  an  edge  detection  method.  They  were  aware  of  the  problem  of  inner 
holes  and  used  a  threshold  of  intensity  to  decide  the  eligibility  of  pixels  as  foreground.  Hirata  et  al. 
(2002)  and  Angulo  and  Serra  (2003)  used  mathematical  morphology.  Their  methods  can  deal  with 
blank  spots,  but  not  with  spots  with  inner  holes.  Glasbey  and  Ghazal  (2003)  used  a  combinatorial 
way  to  consider  a  variety  of  methods,  including  fixed  circle,  proportions  of  histogram,  k-means 
clustering  with  different  preprocessing  and  different  parameters.  O’Neill  et  al.  (2003)  recreate  the 
background  slide  and  subtract  it.  Their  method  deals  effectively  with  global  artifacts  that  involve 
a  substantial  number  of  spots,  but  not  with  inner  holes  or  more  local  artifacts. 

Automatic  gridding  is  necessary  before  applying  our  method,  and  any  automatic  gridding 
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method  could  be  used  in  combination  with  our  segmentation  approach.  We  have  developed  a  very 
simple  technique  for  gridding,  which  is  included  in  the  spotSegmentation  software.  More  complex 
automatic  gridding  methods  have  been  proposed  by  Jung  and  Cho  (2002)  who  use  nearest-neighbor 
graph  methods,  Galinsky  (2003)  who  uses  Voronoi  diagrams,  and  by  Katzer  et  al.  (2003)  who  use 
a  Markov  random  field  approach,  as  well  as  by  authors  of  several  of  the  more  comprehensive  seg¬ 
mentation  methods  mentioned  above.  Our  automatic  gridding  method  is  much  simpler,  and  we 
have  found  it  to  be  effective. 
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